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(N Abstract 

We introduce a new compression scheme for labeled trees based on top trees [3]. Our com- 
.^^ pression scheme is the first to simultaneously take advantage of internal repeats in the tree (as 

opposed to the classical DAG compression that only exploits rooted subtree repeats) while also 
supporting fast navigational queries directly on the compressed representation. We show that 
the new compression scheme achieves close to optimal worst-case compression, can compress 
exponentially better than DAG compression, is never much worse than DAG compression, and 
supports navigational queries in logarithmic time. 

1 Introduction 

t— I A labeled tree T is a rooted ordered tree with n nodes where each node has a label from an alphabet 

E. Many classical applications of trees in computer science (such as tries, dictionaries, parse trees, 
suffix trees, and XML databases) generate navigational queries on labeled trees (e.g, returning the 
label of node v, the parent of v, the depth of v, the size of v's subtree, etc.). In this paper we 
present new and simple compression scheme that support such queries directly on the compressed 
representation. 

While a huge literature exists on string compression, labeled tree compression is much less 
studied. The simplest way to compress a tree is to serialize it using, say, preorder traversal to get 
a string of labels to which string compression can be applied. This approach is fast and is used in 
practice, but it does not support the various navigational queries. Furthermore, it does not capture 
possible repeats contained in the tree structure. 

To get a sublinear space representation for trees with many repeated substructures (such as XML 
databases), one needs to define "repeated substructures" and devise an algorithm that identifies 
such repeats and collapses them (like Lempel-Ziv does to strings). There have been two main 
ways to define repeats: subtree repeats and the more general tree pattern repeats (see Fig. [I]). A 
subtree repeat is an identical (both in structure and in labels) occurrence of a rooted subtree in T. 
A tree pattern repeat is an identical (both in structure and in labels) occurrence of any connected 
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Figure 1: A tree T with a subtree repeat T" (left), and a tree pattern repeat T" (right). 



subgraph of T. Subtree repeats are used in DAG compression [7 13 and tree patterns repeats in 
tree grammars |8, 9, [l8[|20] . In this paper we introduce top tree compression based top trees [3] 
that exploits tree pattern repeats. Compared to the existing techniques our compression scheme 
has the following advantages: Let T be a tree of size n with nodes labeled from an alphabet of 
size a. We support navigational queries in O(logn) time (a similar result is not known for tree 
grammars), the compression ratio is in the worst case at least log CT n (no such result is known for 
neither DAG compression or tree grammars), our scheme can compress exponentially better than 
DAG compression, and is never worse than DAG compression by more than a log n factor. 



1.1 Previous Work 

Using subtree repeats, a node in the tree T that has a child with subtree T" can instead point to any 
other occurrence of T". This way, it is possible to represent T as a directed acyclic graph (DAG). 
Over all possible DAGs that can represent T, the smallest one is unique and can be computed in 
0(n) time [IT]. Its size can be exponentially smaller than n. Using subtree repeats for compression 
was studied in [7 13 , and a Lempel-Ziv analog of subtree repeats was suggested in [lj. It is also 
possible to support navigational queries [6] and path queries [7] directly on the DAG representation 
in logarithmic time. 

The problem with subtree repeats is that we can miss many internal repeats. Consider for 
example the case where T is a single path of n nodes with the same label. Even though T is highly 
compressible (we can represent it by just storing the label and the length of the path) it does not 
contain a single subtree repeat and its minimal DAG is of size n. 

Alternatively, tree grammars are capable of exploiting tree pattern repeats. Tree grammars gen- 
eralize grammars from deriving strings to deriving trees and were studied in |8||9 |l8f[20| . Compared 
to DAG compression a tree grammar can be exponentially smaller than the minimal DAG [18J . Un- 
fortunately, computing a minimal tree grammar is NP-Hard [lO], and all known tree grammar based 
compression schemes can only support navigational queries in time proportional to the height of 
the grammar which can be Q(n). 



1.2 Our Results. 

We propose a new compression scheme for labeled trees, which we call top tree compression. To 
the best of our knowledge, this is the first compression scheme for trees that (i) takes advantage of 
tree pattern repeats (like tree grammars) but (ii) simultaneously supports navigational queries on 



the compressed representation in logarithmic time (like DAG compression). In the worst case, we 
show that (iii) the compression ratio of top tree compression is always at least log CT n (compared 
to the information-theoretic lower bound of log CT n) . This is in contrast to both tree grammars and 
DAG compression that do not have good provable worst-case compression performance. Finally, 
we compare the performance of top tree compression to DAG compression. We show that top tree 
compression (iv) can compress exponentially better than DAG compression, and (v) is never much 
worse than DAG compression. 

With these features, top tree compression significantly improves the state-of-the-art for tree 
compression. Specifically, it is the first scheme to simultaneously achieve (i) and (ii) and the first 
scheme based on either subtree repeats or tree pattern repeats with provable good compression 
performance compared to worst-case (iii) or the DAG (iv). 

The key idea in top tree compression is to transform the input tree T into another tree T 
such that tree pattern repeats in T become subtree repeats in T ■ The transformation is based 
on top trees [2j|4] - a data structure originally designed for dynamic (uncompressed) trees. After 
the transformation, we compress the new tree T using the classical DAG compression resulting in 
the top DAG TV. The top DAG TV forms the basis for our compression scheme. We obtain our 
bounds on compression (iii), (iv), and (v) by analyzing the size of TV , and we obtain efficient 
navigational queries (ii) by augmenting TV with additional data structures. 

To state our bounds, let uq denote the total size (vertices plus edges) of the graph G. We first 
show the following worst-case compression bound achieved by the top DAG. 

Theorem 1 Let T be any ordered tree with nodes labeled from an alphabet of size a and let TV be 
the corresponding top DAG. Then, rifv = 0(nx /\og a nr). 

This worst-case performance of the top DAG should be compared to the optimal information- 
theoretic lower bound of 0,(ut/ log CT tit)- Note that with standard DAG compression the worst-case 
bound is 0{ut) since a single path is incompressible using subtree repeats. 
Secondly, we compare top DAG compression to standard DAG compression. 

Theorem 2 Let T be any ordered tree and let D and TV be the corresponding DA G and top DA G, 
respectively. For any tree T we have n-yv = 0(log n^r) ■ n D an d there exist families of trees T such 
that no = Q(tit / 'lognr) • i^tv- 

Thus, top DAG compression can be exponentially better than DAG compression and it is always 
within a logarithmic factor of DAG compression. To the best of our knowledge this is the first 
non-trivial bound shown for any tree compression scheme compared to the DAG. 

Finally, we show how to represent the top DAG TV in 0{nq-£i) space such that we can quickly 
answer a wide range of queries about T without decompressing. 

Theorem 3 Let T be an ordered tree with top DAG TV. There is an 0{nft>) space representation 
of T that supports Access, Depth, Height, Size, Parent, Firstchild, NextSibling, LevelAncestor, and 

NCA in O(lognT) time. Furthermore, we can Decompress a subtree T" ofT in time 0(logriT + \T'\). 

The Access, Depth, Height, Size, Parent, Firstchild, and NextSibling all take a node v in T as input 
and return its label, its depth, its height, the size of its subtree, its parent, its first child, and its 
immediate sibling, respectively. The LevelAncestor returns an ancestor at a specified distance from 
v, and NCA returns the nearest common ancestor to a given pair of nodes. Finally, the Decompress 
operation decompresses and returns any rooted subtree. 



1.3 Related work (Succinct data structures) 



Jacobson [16] was the first to observe that the naive pointer-based tree representation using 
0(nlogn) bits is wasteful. He showed that unlabeled trees can be represented using 2n + o(n) 
bits and support various queries by inspection of 0(lgn) bits in the bit probe model. This space 
bound is asymptotically optimal with the information-theoretic lower bound averaged over all trees. 



Munro and Raman 21 showed how to achieve the same bound in the RAM model while using 
only constant time for queries. Such representations are called succinct data structures, and have 
been generalized to trees with higher degrees 1 5 1 and to a richer set of queries such as subtree-size 
queries [2l] and level- ancestor queries [15] . For labeled trees, Ferragina et al. [12] gave a representa- 
tion using 2ra log a + 0{n) bits that supports basic navigational operations, such as find the parent 
of node v , the i'th child of v, and any child of v with label a. 

All the above bounds for space are averaged over all trees and do not take advantage of the cases 
where the input tree contains many repeated substructures. The focus of this paper is achieving 
sublinear bounds in trees with many repeated substructures (i.e., highly compressible trees). 

2 Top Trees and Top DAGs 

Top trees were introduced by Alstrup et al. [2[j4| for maintaining an uncompressed, unordered, and 
unlabeled tree under link and cut operations. We extend them to ordered and labeled trees, and 
then introduce top DAGs for compression. Our construction is related to well-known algorithms 
for top tree construction, but modified for our purposes. In particular, we need to carefully order 
the steps of the construction to guarantee efficient compression, and we disallow some combination 
of cluster merges to ensure fast navigation. 

2.1 Clusters 

Let v be a node in T with children vi,...,i>k in left-to-right order. Define T{v) to be the subtree 
induced by v and all proper descendants of v. Define F(v) to be the forest induced by all proper 
descendants of v. For 1 < s < r < k let T(v,v s ,v r ) be the tree pattern induced by the nodes 
{v} U T(v s ) U T(v s+1 ) U • • • U T{v r ). 

A cluster with top boundary node v is a tree pattern of the form T(v,v s ,v r ), 1 < s < r < k. 
A cluster with top boundary node v and bottom boundary node u is a tree pattern of the form 
T(v, v s ,v r ) \ F(u), 1 < s < r < k, where u is a node in T(v s ) U • • • U T(ty). Clusters can therefore 
have either one or two boundary nodes. For example, let p{v ) denote the parent of v then a single 
edge (v ,p(v)) of T is a cluster where p{v) is the top boundary node. If v is a leaf then there is no 
bottom boundary node, otherwise v is a bottom boundary node. 

Two edge disjoint clusters A and B whose vertices overlap on a single boundary node can be 
merged if their union C = A U B is also a cluster. There are five ways of merging clusters, as 
illustrated by Fig. [2] The original paper on top trees [2]-[4] contains more ways to merge clusters, 
but allowing these would lead to a violation of our definition of clusters as a tree pattern of the 
form T(v,v s ,v r ) \ F(u), which we need for navigational purposes. 




(a) 



(b) 



Figure 2: Five ways of merging clusters. The • nodes are boundary nodes that remain boundary 
nodes in the merged cluster. The o nodes are boundary nodes that become internal (non-boundary) 
nodes in the merged cluster. 

2.2 Top Trees 

A top tree T of T is a hierarchical decomposition of T into clusters. It is an ordered, rooted, and 
binary tree and is defined as follows. 

• The nodes of T correspond to clusters of T. 

• The root of T is the cluster T itself. 

• The leaves of T correspond to the edges of T. The label of each leaf is the pair of labels of 
the endpoints of the edges in T. 

• Each internal node of T is a merged cluster of its two children. The label of each internal 
node is the type of merge it represents (out of the five merging options). The children are 
ordered so that the left child is the child cluster visited first in a preorder traversal of T. 



2.3 Constructing the Top Tree 

We now describe a greedy algorithm for constructing a top tree T of T that has height O(lograr). 
The algorithm constructs the top tree T bottom-up in O (log ray) iterations starting with the edges 
of T as the leaves of T ■ During the construction, we maintain an auxiliary tree T initialized as 
T :=T. The edges of T will correspond to the nodes of T and to the clusters of T. In the beginning, 
these clusters represent actual edges (v,p(v)) of T. In this case, if v is not a leaf in T then v is the 
bottom boundary node of the cluster and p(v) is the top boundary node. If v is a leaf then there 
is no bottom boundary node. 

In each one of the O(logn-j') iterations, a constant fraction of T's edges (i.e., clusters of T) are 
merged. Each merge is performed on two overlapping edges (u, v) and (v, w) of T using one of the 
five types of merges from Fig. [2j If v is the parent of u and the only child of w then a merge of 
type (a) or (b) contracts these edges in T into the edge (u, w). If v is the parent of both u and w, 
and w or u are leaves, then a merge of type (c), (d), or (e) replaces these edges in T with either the 
edge (u,v) or (v,w). In all cases, we create a new node in T whose two children are the clusters 
corresponding to (u,v) and to (v,w). 



This way, we get that a single iteration shrinks the tree T (and the number of parentless nodes 
in T) by a constant factor. The process ends when T is a single edge. Each iteration is performed 
as follows: 

Step 1: Horizontal Merges. For each node v G T with k > 2 children vi,...,Vk, for i = 1 to 
[k/2\ , merge the edges (v, V2i-i) and (v, V2i) if V2i-i or t>2i is a leaf. If /c is odd and Vk is a leaf and 
both Vk-2 arid Vk-i are non-leaves then also merge (v,Vk-i) and (v,Vk)- 

Step 2: Vertical Merges. For each maximal path v±, . . . ,v p of nodes in T such that Vi+\ is 
the parent of v% and t>2, ■ ■ ■ ,v p —i have a single child: If p is even merge the following pairs of 
edges {(vx,V2),(v 2 ,v 3 )},{(v 3 ,V4 : ),(v4 : ,v 5 )},...,(vp-2,Vp-i)}. lip is odd merge the following pairs 
of edges {(wi,^), (v2,v 3 )},{(v 3 ,Vi), (v4,v 5 )}, ..., (v p -3,v p - 2 )}, and if (v p -i,v p ) was not merged in 
Step 1 then also merge {(v p -2, v p -i), {v p -\, v p )}. 

Lemma 1 A single iteration shrinks T by a factor of c > 8/7. 

Proof. Suppose that in the beginning of the iteration the tree T has n nodes. Any tree with n 
nodes has at least n/2 nodes with less than 2 children. Consider the edges (vi,p(vi)) of T where Vi 
has one or no children. We show that at least half of these n/2 edges are merged in this iteration. 
This will imply that n/4 edges of T are replaced with n/8 edges and so the size of T shrinks to 
7n/8. To prove it, we charge each edge (vi,p(vi)) that is not merged to a unique edge f(vi,p(vi)) 
that is merged. 

Case 1. Suppose that Vi has no children (i.e., is a leaf). If Vi has at least one sibling and 
(vi,p(vi)) is not merged it is because Vi has no right sibling and its left sibling Vi-\ has already 
been merged (i.e., we have just merged (vi-2, p(vi-2)) and (uj_i,p(uj„i)) in Step 1 where p(vj) = 
p{vi-i) = p{vi-2)). We also know that at least one of Vi-\ and vi-2 must be a leaf. We set 
f(vi,p(vi)) = (vi-i,p(yi-i)) iivi-i is a leaf, otherwise we set f(vi,p(vi)) = («i_2,p(uf-2))- 

Case 2. Suppose that vi has no children (i.e., is a leaf) and no siblings (i.e., p(vi) has only 
one child). The only reason for not merging (vi,p(vi)) with {p{yn) , p(p(vi))) in Step 2 is because 
(p(vi) , p{p{vi))) was just merged in Step 1. In this case, we set f(vi,p(vi)) = (p(vi) , p(p(vi))) . Notice 
that we haven't already charged (p(vi) , p{p{vi)) in Case 1 because p{vi) is not a leaf. 

Case 3. Suppose that Vi has exactly one child c(vi) and that (vi,p(vi)) was not merged in 
Step 1. The only reason for not merging (vi,p(vi)) with (c(vi),Vi) in Step 2 is if c(vi) has 
only one child c(c(vi)) and we just merged (c(vi),Vi) with (c(c(vi)) , c(vi)) . In this case, we set 
f(vi,p(vi)) = (c(vi),Vi). Notice that we haven't already charged (c(vi),Vi) in Case 1 because c{vj) 
is not a leaf. We also haven't charged (c(t>j),t>j) in Case 2 because vi has only one child. □ 



Corollary 1 Given a tree T, the greedy top tree construction creates a top tree of size 0{nT) and 
height Oilognx) in 0{ut) time. 

The next lemma follows from the construction of the top tree and Lemma [TJ 



Lemma 2 For any node c in the top tree corresponding to a cluster C of T, the total size of all 
clusters corresponding to nodes in the subtree T(c) is 0(\C\). 

2.4 Top Dags 

The top DAG of T, denoted TT>, is the minimal DAG representation of the top tree T. It can be 
computed in 0(jit) time from T using the algorithm of 111]. The entire top DAG construction can 
thus be done in 0{tit) time. 

3 Compression Analysis 

3.1 Worst-case Bounds for Top Dag Compression 

We now prove Theorem [T] 

Let T be the top tree for T. Identical clusters in T are represented by identical complete 
subtrees in T ■ Since identical subtrees in T are shared in TT> we have the following lemma. 

Lemma 3 For any tree T, all clusters in the corresponding top DAG TV are distinct. 

Lemma 4 Let T be any tree with ny nodes labeled from an alphabet of size a and let T be its top 
tree. The nodes of T correspond to at most 0(nr/ log°' ny) distinct clusters in T. 

Proof. Consider the bottom-up construction of the top tree T starting with the leaves of T (the 

clusters corresponding to the edges of T). By Lemma [I] each level in the top tree reduces the 

number of clusters by a factor c = 8/7, while at most doubling the size of the current clusters. 

After round i we are therefore left with at most 0{nT/c l ) clusters, each of size at most 2 l + 1. 

To bound the total number of distinct cluster, we partition the clusters into small clusters and 

large clusters. The small clusters are those created in rounds 1 to j = log 2 (0.51og 4o . tit) and the 

large clusters are those created in the remaining rounds from j + 1 to h. The total number of large 

clusters is at most 

h 

£ Oinr/J) = 0(n T /ci +1 ) = 0(n T /log° 19 n T ). 

i=j+i 

In particular, there are at most 0(ut/ log a tit) distinct clusters among these. 

Next, we bound the total number of distinct small clusters. Each small cluster corresponds to 
a connected subgraph (i.e., a tree pattern) of T that is of size at most 2 J + 1 and is an ordered and 
labeled tree. The total number of distinct ordered and labeled trees of size at most x is given by 

f <**-i = E T (T-i 1 ) = £ 0(4V) = ° ( (4CT) * +1 ) • 

i=l i=l V ' i=l 

where Cj denotes the iih. Catalan number. Hence, the total number of distinct small clusters 
is bounded by 0((4<t) 2:,+2 ) = 0{a 2 y/nr) = 0(n^ ). In the last equality we used the fact that 
a < rirp . If a > n T then the lemma trivially holds because 0(nT/(log°' 19 tit)) = 0(nr)- We get 
that the total number of distinct clusters is at most 0(nr/ log^' 19 ut + rij, ) = 0(n;r/log^' 19 n;r). 

□ 

Combining Lemma [3] and [4] we obtain Theorem [T] 



3.2 Comparison to Subtree Sharing 

We now prove Theorem [2j To do so we first show two useful properties of top trees and top dags. 

Let T be a tree with top tree T ■ For any internal node z in T, we say that the subtree T(z) is 
represented by a set of clusters {C\, . . . , Ci} from T if T(z) = C\ U • • ■ U Cf. Since each edge in T 
is a cluster in T we can always trivially represent T(z) by at most |T(z)| — 1 clusters. We prove 
that there always exists a set of clusters, denoted S z , of size O(logny) that represents T{z). 

Let z be any internal node in T and let z\ be its leftmost child. Since z is internal we have that 
z is the top boundary node of the leaf cluster L = (z, z\) in T . Let U be the smallest cluster in T 
containing all nodes of T{z). We have that L is a descendant leaf of U in T ■ Consider the path P 
of cluster in T from U to L. An off-path cluster of P is a cluster C that is not on P, but whose 
parent cluster is on P. We define 

S z = {C | C is off-path cluster of P and C C T(z)} U {L} 

Since the length of P is O(lograT) the number of clusters in S z is 0(logn;r). We need to prove 
that Uce5 z C = T(z). By definition we have that all nodes in Uces z C are in T(z). For the other 
direction, we first prove the following lemma. Let E{C) denote the set of edges of a cluster C. 

Lemma 5 Let C be an off-path cluster of P. Then either E(C) C E(T(z)) or E{C)C\E(T(z)) = 0. 

Proof. We will show that any cluster in T containing edges from both T(z) and T\T(z) contains 
both (p(z), z) and (z, z\), where z\ is the leftmost child of z and p(z) is the parent of z. Let C be a 
cluster containing edges from both T(z) and T\T(z). Consider the subtree T(C) and let C be the 
smallest cluster containing edges from both T(z) and T \ T{z). Then C must be a merge of type 
(a) or (b), where the higher cluster A only contains edges from T\T{z) and the bottom cluster, B, 
only contains edges from T{z). Also, z is the top boundary node of B and the bottom boundary 
node of A. Clearly, A contains the edge (p(z),z), since all clusters are connected tree patterns. A 
merge of type (a) or (b) is only possible when B contains all children of its top boundary node. 
Thus B contains the edge (z, z\). 

We have L = (z,Z\) and therefore all clusters in T containing (z,Z\) lay on the path from L 
to the root. The path P is a subpath of this path, and thus no off-path clusters of P can contain 
(z, z{). Therefore no off-path clusters of P can contain edges from both T{z) and T \ T(z). □ 

Any edge from T(z) (except (z, zi)) contained in a cluster on P must be contained in an off-path 
cluster of P. Lemma p)l therefore implies that T(z) = Uc g s z C and the following corollary. 

Corollary 2 Let T be a tree with top tree T. For any node z in T, the subtree T{z) can be 
represented by a set of 0{\ognx) clusters in T. 

Next we prove that our bottom-up top tree construction guarantees that two identical subtrees 
T(z), T(z') are represented by two identical sets of clusters S z , S z >. Two sets of clusters are identical 
(denoted S z = S z > ) when C 6 S z iff C £ S z such that C and C are clusters corresponding to tree 
patterns in T that have the same structure and labels. 

Lemma 6 Let T be a tree with top tree T. Let T{z) and T(z') be identical subtrees in T and let 
S z and S z i be the corresponding representing set of clusters in T '. Then, S z = S z > . 



Proof. Consider the tree T at some iteration of the construction of the top tree. We will say 
that an edge e in T belongs to T{z) (reps. T{z')) if the cluster corresponding to e only contains 
edges from T(z) (reps. T{z')) in the original tree. Let L z be the cluster in T containing the edge 
L = (z,zi), where z\ is the leftmost child of z. Define L z i similarly. 

Recall that U is the smallest cluster in T containing all nodes of T(z) and that P is the path 
of cluster in T from U to L. By definition, all clusters on the path P contain L. This implies 
that new off-path clusters are only constructed when L z (resp. L z r) is merged. Merges of identical 
edges belonging to T{z) and T(z') are the same in the two subtrees of T, since we merge first 
horizontally, and then vertically bottom-up. By the same argument if L z is merged with an edge 
belonging to T(z) then L z i is merged with the corresponding edge from T(z'). For a merge with an 
edge belonging to T(z) (resp. T(z')) and an edges not belonging to T(z) (resp. T(z')), one of the 
edges must be L z (reps. L z i). If L z is merged in this iteration, but L z i is not, then L z is merged 
with an edge not belonging to T(z) (and vice versa). Thus, after the iteration all edges belonging 
to T(z) in T are identical to the edges belonging to T(z') in T. 

New off-path clusters are only constructed when L z (reps L z i) are merged. It only creates new 
clusters in S z (resp. S z >) if it is a merge with an edge belonging to T(z) (resp. T(z')). Since 
these merges are identical for the two subtrees in each iteration, and L z is merged with an edge be- 
longing to T(z) iff L z / is merged with the corresponding edge belonging to T(z'), we have S z = S z i.d 



Theorem 4 For any tree T, n-rx> = O(lograr) • nrj- 

Proof. An edge is shared in the DAG if it is in a shared subtree of T. We denote the edges in the 
DAG D that are shared as red edges, and the edges that are not shared as blue. Let rrj and bo be 
the number of red and blue edges in the DAG D, respectively. 

A cluster in the top tree T is red if it only contains red edges from D, blue if it only contains 
blue edges from D, and purple if it contains both. Since clusters are connected subtrees we have 
the property that if cluster C is red (resp. blue), then all clusters in the subtree T{C) are red (resp. 
blue). Let r, b, and p be the number of red, blue, and purple clusters in T, respectively. Since T 
is a binary tree, where all internal nodes have 2 children and all leaves are either red or blue, we 
have p < r + b. It is thus enough to bound the number of red and blue clusters. 

First we bound the number of red clusters in the DAG TT>. Consider a shared subtree T(z) 
from the DAG compression. T(z) is represented by at most O(logn-r) clusters in T, and all these 
contain only edges from T(z). It follows from Lemmapkhat all the clusters representing T(z) (and 
their subtrees in T) are identical for all copies of T{z). Therefore each of these will appear only 
once in the top DAG TT>. 

From Corollary^we have that each of the clusters representing T(z) has size at most 0(\T(z)\). 
Thus, the total size of the subtrees of the clusters representing T{z) is 0(|T(z)| lognr)- This is 
true for all shared subtrees, and thus r = 0{td logn^)- 

To bound the number of blue clusters in the top DAG, we first note that the blue clusters 
form rooted subtrees in the top tree. Let C be the root of such a blue subtree in T '. Then C 
is a connected component of blue edges in T. It follows from Corollary M that |T(C)| = 0(|C|). 
Thus the number of blue clusters b = 0(brj)- The number of edges in the TT> is thus b + r + p < 
2(b + r) = 0(b D + r D \ognT) = 0(n D logn T ). □ 
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Figure 3: A Top DAG T-l? and a DAG D(T) of (a) a path and (b) a complete binary tree. On 
a path (and also a caterpillar and a star) the size of TT> is O(lograT) whereas the size of D(T) is 
0{ut)- On a complete binary tree (b) both TV and D(T) are of size O(lognr). 



Lemma 7 There exist trees T, such that no = fi(n:r/ log ray) ■ raj-D. 

Proof. Caterpillars and paths have rifp = O(logn-r), whereas no = nr (see Figure [3|. 



□ 



4 Supporting Navigational Queries 

In this section we prove Theorem [3j Let T be a tree with top DAG TT>. To uniquely identify nodes 
of T we refer to them by their preorder numbers. For a node of T with preorder number x we want 
to support the following queries. 

Access(x): Return the label associated with node x. 

Decompress(x): Return the tree T(x). 

Parent(x): Return the parent of node x. 

Depth (x): Return the depth of node x. 

Height(x): Return the height of node x. 

Size(x): Return the number of nodes in T(x). 

Firstchild(x): Return the first child of x. 

NextSibling(x): Return the sibling immediately to the right of x. 

Level Ancestor(x, i): Return the ancestor of x whose distance from x is %. 

NCA(x, y): Return the nearest common ancestor of the nodes x and y. 
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4.1 The Data Structure 

In order to enable the above queries, we augment the top DAG TV of T with some additional 
information. Consider a cluster C in TV. Recall that if C is a leaf in TV then C is a single edge 
in T and C stores the labels of this edge's endpoints. Otherwise, C is a cluster of T obtained by 
merging two clusters: the cluster A corresponding to C"s left child and the cluster B corresponding 
to C"s right child. Consider the sequence of nodes in a preorder traversal of C. Let r{A) denote 
the rightmost node in the sequence that is also a node of A. Let £(B) and r(B) denote the leftmost 
and rightmost nodes in the sequence that are also in B. We augment each cluster C with: 

• The integers r(A), £(B), and r(B). 

• The type of merge that was applied to A and B to form C. 

• The height and size of C (i.e., of the tree pattern C in T). 

• The distance from the top boundary node of C to the top boundary nodes of A and B. 

Since we use constant space for each cluster of TV, the total space remains O^n-j-v)- 

Local preorder numbers All of our queries are based on traversals of the augmented top DAG 
TV. During the traversal we identify nodes by computing preorder numbers local to the cluster 
that we are currently visiting. Specifically, let u be a node in the cluster C. Define the local preorder 
number ofu, denoted uc, to be the position of u in a preorder traversal of C. The following lemma 
states that in 0(1) time we can compute ua and ub from uc and vise versa. 

Lemma 8 Let c be an internal node of TV that corresponds to the cluster C of T obtained by 
merging the cluster A (corresponding to c 's left child) and the cluster B (corresponding to c 's right 
child). For any node u in C, given uq we can tell in constant time if u is in A (and obtain ua) in 
B (and obtain ub) or in both. Similarly, if u is in A or in B we can obtain uq in constant time 
from ua or ub ■ 

Proof. If C is a merge of A and B of type (a) or (b) then 

• uc = 1 iff u is the top boundary node of A and C and ua = 1- 

• uc £ [2, £(B) — 1] iff u is an internal node of A and ua = uc- 

• uc = £{B) iff u is the shared boundary node of A and B, ua = £(B), and ub = 1- 

• uc £ [£(B) + 1, r(B)] iff u is an internal node in B and ub = uc — £{B) + 1. 

• uc £ [r(B) + 1, r(A)] iff u is an internal node in A and ua = uc ~ r(B) + (.(B). 
Otherwise, if C is a merge of A and B of type (c), (d), or (e) then 

• uc = 1 iff u is the shared boundary node of A, B, and C and ua = ub = 1- 

• uc £ [2, r(A)] iff u is an internal node in A and ua = uc- 

• uc £ [t(A) + l,r(B)] iff u is an internal node in B and ub = uc — r{A) + 1. 
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4.2 Implementation of the procedures 

We now show how to implement the queries using local preorder numbers in top-down and bottom- 
up tr aver sals of TT>. 

4.2.1 Access and Depth 

The queries Access(x) and Depth(x) ask for the label and depth of the node whose preorder number 
in T is x. They are both performed by a single top-down search of 1~T> starting from its root and 
ending with the leaf cluster containing x. Since the depth of TT> is O(lognj-) the total time is 
0(logn T ). 

Access. At each cluster C on the top-down search we compute the local preorder number xc- 
Initially, the root cluster corresponds to the entire T so we set xt = x. Let C be a cluster on the 
way. If C is a leaf cluster we return the label of the top boundary node if xq = 1 and the label of 
the single internal node if xc = 2. If on the other hand C is an internal cluster with child clusters 
A and B, we continue the search in the child cluster containing xc- We compute the new local 
preorder number according to Lemma [8} If xc is the shared boundary node between A and B we 
continue the search in either i or B. 

Depth The only difference between Depth (x) and Access(x) is that during the top-down search 
we also sum the distances between the top boundary nodes of the visited clusters. Let d be this 
distance. At the leaf cluster at the end of the search we return d if xc = 1 and d + 1 if xc = 2. 
Since the distances are stored the total time remains 0(logn/r). 

4.2.2 Firstchild, Level Ancestor, Parent, and NCA 

We answer these queries by a top-down search to find the local preorder number in a relevant 
cluster C, and then a bottom-up search to compute the corresponding preorder number in T. 

Firstchild We compute Firstchild(x) in two steps. 

Step 1: Top-down Search. We do a top-down search to find the first cluster with top boundary 
node x. We use local preorder numbers as in the algorithm for Access. Let C be a cluster in the 
search. If xc = 1 we stop the search. Otherwise we know that xc > 1- If C is a leaf cluster we 
stop and report that x does not have a first child since it is a leaf in T. If on the other hand C is an 
internal cluster with child clusters A and B, we continue the search in the child cluster containing 
xq- If xq is the shared boundary node between A and B we always continue the search in B. This 
ensures that we continue to the cluster containing the children of x (recall that B is the deeper 
cluster in merges of type (a) and (b)). Combined with the condition that we stop the search in the 
first cluster C where x is the top boundary node, this implies that all children of x are in C. 

Step 2: Bottom-up Search. Let C be the cluster found in Step 1. Since all children of x are 
in C, the node with local preorder number 2 in C is the first child of x. We do a bottom-up search 
from C to the root cluster to compute the preorder number in T of the node with xc = 2. 
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Level Ancestor and Parent Notice that Parent(x) can be computed as Level Ancestor(x, 1). 
Since LevelAncestor(x, 0) = x we focus on LevelAncestor(x, i) for i > 1. This is done in three steps: 

Step 1: Compute Depth. Compute the depth of LevelAncestor(a;, i) as d = Depth(x) — i. 

Step 2: Top-down Search. We do a top-down search to find the cluster with top boundary 
node y of depth d such that x is a descendant of y. During the search we maintain the depth of the 
current top boundary node as in the algorithm for Depth. At each cluster C in the search we also 
compute a local preorder number x' c to guide the search. The idea is that x' c either corresponds 
to x or to an ancestor of x within C. Initially, for the root cluster T we set x' T = x. Let C be an 
internal cluster in the search with top boundary node v and with children A and B. If the depth 
of v is d we stop the search. Otherwise, we proceed as follows. 

1. If C is of type (a) or (b), x' c is in B, and the shared boundary node of A and B has depth 
> d, we continue the search in A and set x' A to be the bottom boundary of A. 

2. In all other cases, we continue the search in the child cluster containing x' c , and compute the 
new local preorder number for x' c . 

Note that if the shared boundary node in case 1 has depth d we continue the search in B. Combined 
with the assumption that i > 0, it inductively follows that y becomes the top boundary node at 
some cluster during the top-down search. Hence, at some cluster in the top-down search the depth 
of the top boundary node is d. 

Step 3: Bottom-up Search. Let C be the cluster whose top boundary node v has depth d 
found in Step 2. We do a bottom- up search to compute the preorder number of v in T. Finally, we 
report the result as y. 

Nearest Common Ancestor. We compute NCA(x, y) in the following steps. We assume w.l.o.g. 
that x t^ y in the following since NCA(x,x) = x. 

Step 1: Top-down Search We do a top-down search to find the first cluster, whose top boundary 
node is nca(x, y) (this cluster always exists since x ^ y). At each cluster C in the search we compute 
local preorder numbers x' c and y' c . The idea is that x' c and y' c are either x or y or ancestors of 
x and y and their depth is at most the depth of nca(x,y). Initially, for the root cluster T we set 
x' T = x and y' T = y. Let C be a cluster visited during the search. If C is a leaf cluster we stop the 
search. Otherwise, C is an internal cluster with children A and B. We proceed as follows. 

1. If x' c and y' c are in the same child cluster, we continue the search in that cluster, and compute 
new local preorder numbers for x' c and y' c . 

2. If C is of type (a) or (b) and x' c and y' c are in different child clusters we continue the search 
in A. We update the local preorder number of the node in B to be the bottom boundary of 
A. 

3. If C is of type (c), (d), or (e) and x' c and y' c are in different child clusters we stop the search. 
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Step 2: Bottom-up Search Let C be the cluster computed in step 1. We do a bottom-up 
search to compute the preorder number of the top boundary node of C in the entire tree T, and 
return the result. 

4.2.3 Decompress, Height, Size, and Next Sibling 

To answer these queries, the key idea is to compute a small set of clusters representing T{x). This 



set will be a subset of the set S x defined in Sec. 3.2 and will contain all the relevant information 



We need the following definitions. Let u be a node in T. We say that u is on the spine path in 
a cluster C if u is the top boundary node in C, or u is on the path from the top boundary node in 
C to the bottom boundary node in C. Since clusters are connected subtrees we immediately have 
the following. 

Lemma 9 Let C = AL) B be a cluster with left child A and right child B. A node u inT is on the 
spine path of C iff one of the following cases are true: 

• C is of type (c) and u is on the spine path in A. 

• C is of type (d) and u is on the spine path in B. 

• C is of type (a) and u is on the spine path in A or B. 

• u is the top boundary node of C . 



Let x be any internal node in T. As in Section 3.2 let L be the leftmost leaf cluster in TT> 
such that x is the top boundary node and let P the path of clusters from the smallest cluster U 
containing all nodes of T(x) to L. We also define M to be the highest cluster on P that has x as the 
top boundary node, i.e., M is the highest cluster on P that only contains edges from T{x). Recall 
that S x is the set of O(lognr) off-path cluster of P that represent T(x). We partition S x into the 
set S x that contains all clusters in S x that are descendants of M and the set S x that contains the 
remaining clusters. We characterize these sets as follows. 

Lemma 10 Let B be an off-path cluster of P with parent C and sibling A. Then 

1. B is in S x iff B is a descendant of M. 

2. B is in S x iff C is a merge of type (a) or (b), B is the right child of C , and x is on the spine 
path of A. 

Proof. For the first property, first note that if B is in S x it is by definition a descendant of M. 
Conversely, if B is a descendant of M, we have that E(B) C E{M) C E{T{x)). By definition, we 
have that B is in S x . 

Next consider property 2. Suppose that B is in S x . Then, by Lemma pi we have that E(B) C 
E(T(x)). Furthermore, since C is a proper ancestor of M, C contains edges from both T{x) and 
T \ T(x), and therefore A must also contain edges from both T and T \ T(x). 

Assume for contradiction that C is of type (c), (d), or (e). Then, the top boundary node v of C is 
also the top boundary node in A and B. Since x ^ v, we have by Lemma[5]that E(C)nE(T(x)) = 
and thus B cannot be in 5 T . 



14 



Hence, assume that C is of type (a) or (b). Assume for contradiction that B is the left child of 
C. Since all clusters on P contain E(L) and C contains edges from both T(x) and T \ T(x), we 
have that the bottom boundary node of B is a proper ancestor of x. Hence, B cannot be in S x . 

Finally, if B is of type (a) or (b) and is the right child of C, then E(B) C E(T(x)) iff the top 
boundary node v of B is a descendant of x. But v is a descendant of x iff x is on the spine path of 
A. Hence, B is in S x iff x is on the spine path of A. 

In the following we show how to efficiently compute S x using the procedure FindRepresentatives. 
We then use FindRepresentatives to implement the remaining procedures. 

FindRepresentatives Procedure FindRepresentatives(x) computes the set S x in two steps. 

Step 1: Top-down Search We do a top-down search to find the cluster M, i.e., the highest 
cluster on P that has x as the top boundary node. If no such node exist, then x is a leaf node in 
T. 

Step 2: Bottom-up Search We do a bottom-up search from M and add clusters according to 



Lemma 10 as follows. Initially, set S = 0. Let A be a cluster on the path with sibling B and parent 
C. 

1. If C is of type (a) or (b) and A is the left child of C, add B to S. 

2. If one of the following conditions are true, stop the traversal: 

• C is of type (c) and A is the right child of C. 

• C is of type (d) and A is the left child of C. 

• C is of type (e) or (b). 

Note that, as long as we continue the bottom- up search and consider clusters on the path, we have 
that x is on the spine path of these clusters. This is because we continue the bottom- up search 



according to the cases of Lemma ^ It follows from Lemma 10 that the clusters we add to S are 



exactly the clusters in the set representing T(x). The total time is O(lognr). 

Decompress To compute Decompress(x), we use FindRepresentatives(x) to compute the sets of 
cluster S x . We construct T(x) from S x and the path P computed during the traversal of TV. First, 
we decompress all clusters in S x by unfolding their subDAG and constructing their corresponding 
subtree of T. We then combine these subtrees using the merge information stored for each cluster 
in P. 

In total we use O(lognr) time for FindRepresentatives(a;) and computing the path P. To total 
time to decompress a cluster TV by unfolding is linear in its size. Hence, the total time used is 
0(logn T + |T(x)|). 

Height First we compute the set of cluster S x using FindRepresentatives(x). We compute the 
height of T(x) as the sum of the local heights of all clusters in S x . This correctly computes the 
height since all clusters in S x are merged with their siblings by type (a) or (b). Since the local 
height for each cluster in TV is stored we use Oilognx) time in total. 
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Size Similar to height. We sum the sizes of clusters in S x and subtract \S X \ — 1- This also uses 
O(lognr) time. 

Nextsibling We compute NextSibling(x) directly from Size(x) since NextSibling(:r) = x + S'\ze(x). 

5 Conclusion and Open Problems 

We have presented the new top tree compression scheme, and shown that it achieves close to optimal 
worst-case compression, can compress exponentially better than DAG compression, is never much 
worse than DAG compression, and supports navigational queries in logarithmic time. We conclude 
with some open problems. 

• Surprisingly, top tree compression is the first compression scheme for trees that achieves any 
provable non-trivial compression guarantee compared to the classical DAG compression. We 
wonder how other tree compression schemes compare to DAG compression and if it possible 
to construct a tree compression schemes that exploits tree pattern repeats and compresses 
better than a logarithmic factor of the DAG compression. 

• Pattern matching in compressed strings is a well-studied and well- developed area with numer- 



ous results, see e.g., the surveys [14.17.22 . Pattern matching in compressed trees (especially 



within tree compression schemes that exploit tree pattern repeats) is a wide open area. 

• We wonder if top tree compression is practical. In preliminary experiments we have compared 
our top DAG compression with standard DAG compression on typical XML datasets that 
were previously used in papers on DAG compression. The experiments match our theoretical 
expectations, i.e., that balanced trees compress slightly better with standard DAG than with 
top tree compression, but either shallow trees or deep trees compress better with top tree 
compression. 
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